post thumbnail

Big Data Computing:Batch Processing

Explore batch processing frameworks for large-scale data analysis. Compare Hadoop MapReduce's simplicity with Spark's in-memory computing advantages. Learn how batch processing powers offline data warehouses, analytics, and ML training - perfect for enterprises needing high-throughput processing of historical data. Discover key architectures and use cases.

2025-09-09

In the previous article, [Big Data Storage: HDFS](https://xx/Big Data Storage:HDFS), we discussed the design principles of HDFS and its continuous architectural optimizations in practice. With its distributed, scalable, and fault-tolerant features, HDFS has gradually become the cornerstone of big data storage.

This article focuses on the computing aspect from [Deconstructing Big Data:Storage, Computing, and Querying](https://xx/Deconstructing Big Data:Storage, Computing, and Querying).
Big data computing can be divided into offline computing and real-time computing, where offline computing is also known as batch processing.

Here, we will first focus on batch processing, exploring its principles, architecture, frameworks, and application scenarios.

What Is Batch Processing?

Batch Processing is a big data computing method in which data is collected in batches, processed in bulk, and outputs results in a single run. It is suitable for scenarios involving large-scale data, complex computation logic, and high tolerance for latency — such as daily or monthly report generation.

The core concepts of batch processing are:

Key characteristics include:

Batch Processing Architecture

Following its core principles, a batch processing system typically consists of several layers:

Batch Processing Frameworks

Two widely used batch processing frameworks are MapReduce and Spark.

The MapReduce programming model splits jobs into two main phases:

  1. Map – Transforms input data into a set of key/value pairs.
  2. Reduce – Aggregates values associated with the same key and outputs the final result.

However, complex jobs often require chaining multiple MapReduce jobs. Each job reads data from disk and writes back to disk upon completion, with frequent job startup and shutdown.
Spark addresses these issues by using in-memory processing and DAG scheduling, allowing multiple MapReduce-like operations to be executed within a single Spark job.

Spark job execution flow:

  1. Job Submission – Submitted via Spark SDK or command line.
  2. DAG Construction – Built based on Spark operators.
  3. Job Partitioning & Scheduling – The DAG is split into stages for scheduling.
  4. Data Processing – Data shuffling occurs only when global aggregation or join operations are needed.
  5. Output Results – Computation results are written to the target storage.

Application Scenarios

Batch processing still holds a critical position in enterprise data architectures. Common use cases include:

Limitations & Challenges

Despite its importance, batch processing faces several challenges:

Conclusion

In the big data ecosystem, computing is a crucial link between data storage and application. Batch processing remains the backbone for offline data warehouses, analytics, and machine learning training workloads.
While real-time processing is gaining momentum, batch processing is still irreplaceable.

In the next article, we will dive into big data real-time processing, exploring how it compensates for batch processing’s shortcomings and the unique challenges it faces.